Hierarchical Policy Search via Return-Weighted Density Estimation
نویسندگان
چکیده
Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a challenging motion planning problem of a redundant robotic manipulator. Introduction Recent work on reinforcement learning (RL) has been successful in various tasks, including robotic manipulation (Gu et al. 2017; Levine et al. 2016b; Levine et al. 2016a) and playing a board game (Silver et al. 2016). However, many RL methods cannot leverage a hierarchical task structure, whereas many tasks in the real world are highly structured. Grasping is a good example of such structured tasks. When grasping an object, humans know multiple grasp types from their experience and adaptively decide how to grasp the given object (Cutkosky and Howe 1990; Napier 1956). This strategy can be interpreted as a hierarchical policy where the gating policy first selects the grasp type and the option policy that represents the selected grasp type subsequently plans the grasping motion (Osa, Peters, and Neumann 2016). Prior work on hierarchical RL suggests that learning various option policies increases the versatility (Daniel et al. 2016) and that exploiting a hierarchical task structure can exponentially reduce the search space (Dietterich 2000). However, RL of a hierarchical policy is not a trivial problem. As indicated by Daniel et al. (2016), each option policy in a hierarchical policy needs to focus on a single mode of the return function, otherwise a learned policy will try to average multiple modes and fall into a local optimum with poor performance. Therefore, it is necessary to properly assign option policies to individual modes of the return function, and the challenge of this problem is how to identify the number and locations of modes of the return function. Although regularizers (Bacon, Harb, and Precup 2017; Florensa, Duan, and Abbeel 2017) can be used to drive the option policies to various solutions, they cannot prevent an option policy from averaging over multiple modes of the return function. Additionally, in existing methods, the performance often depends on initialization or pre-training of the policy (Daniel et al. 2016; Florensa, Duan, and Abbeel 2017), and a user often needs to specify the number of the option policies in advance, which significantly affects the performance (Bacon, Harb, and Precup 2017). To address such issues in existing hierarchical RL, we propose model-free hierarchical policy search via returnweighted density estimation (HPSDE). Our approach reduces the problem of identifying the modes of the return function to estimating the return-weighted sample density. Unlike previous methods, the number and the location of the option policies is automatically determined without explicitly estimating the option policy parameters, and an option policy learned by HPSDE focuses on a single mode of the return function. We discuss the connection between the expected return maximization and the density estimation with return-weighted importance. The experimental results show that HPSDE outperforms a state-of-the-art hierarchical RL method and that HPSDE finds multiple solutions in motion planning for a robotic redundant manipulator. Problem Formulation We consider a reinforcement learning problem in the Markov decision process (MDP) framework, where an agent is in a state x ∈ X and takes an action u ∈ U . In this paper, we concentrate on the episodic case of hierarchical RL, where a policy π(τ |x0) generates a trajectory τ for a given initial state x0 and only one selected policy is executed until the end of the episode. After every episode, the agent receives a return of a trajectory R(τ ,x0) = ∑T t=0 rt, which is given by a sum of immediate rewards. T is the length of the trajectory, which is a random variable. In the following, ar X iv :1 71 1. 10 17 3v 1 [ cs .L G ] 2 8 N ov 2 01 7 we denote by s = x0 the initial state of the episode, which is often referred to as the “context”, and we assume that a trajectory τ contains all the information of actions ut and the state xt during the episode. The purpose of policy search is to obtain the policy π(τ |s) that maximizes the expected return (Deisenroth, Neumann, and Peters 2013) J(π) = ∫∫ d(s)π(τ |s)R(s, τ )dτds, (1) where d(s) is the distribution of the context s. In hierarchical RL, we consider a policy that is given by a mixture of option policies:
منابع مشابه
Estimation of Return to Scale under Weight Restrictions in Data Envelopment Analysis
Return-To-Scale (RTS) is a most important topic in DEA. Many methods are not obtained for estimating RTS in DEA, yet. In this paper has developed the Banker-Trall approach to identify situation for RTS for the BCC model "multiplier form" with virtual weight restrictions that are imposed to model by DM judgments. Imposing weight restrictions to DEA models often has created problem of infeasibili...
متن کاملWeighted-HR: An Improved Hierarchical Grid Resource Discovery
Grid computing environments include heterogeneous resources shared by a large number of computers to handle the data and process intensive applications. In these environments, the required resources must be accessible for Grid applications on demand, which makes the resource discovery as a critical service. In recent years, various techniques are proposed to index and discover the Grid resource...
متن کاملWeighted Likelihood Policy Search with Model Selection
Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weig...
متن کاملEfficient Bregman Range Search
We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tas...
متن کاملBayesian change point estimation in Poisson-based control charts
Precise identification of the time when a process has changed enables process engineers to search for a potential special cause more effectively. In this paper, we develop change point estimation methods for a Poisson process in a Bayesian framework. We apply Bayesian hierarchical models to formulate the change point where there exists a step < /div> change, a linear trend and a known multip...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1711.10173 شماره
صفحات -
تاریخ انتشار 2017